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Abstract 

The  U.S.  Air  Force  is  researching  the  fusion  of  multiple  sensors  and  classifiers.  Given  a  finite 
collection  of  classifiers  to  be  fused  one  seeks  a  new  classifier  with  improved  performance.  An  estab¬ 
lished  performance  quantifier  is  the  Receiver  Operating  Characteristic  (ROC)  curve,  which  allows 
one  to  view  the  probability  of  detection  versus  the  probability  of  false  alarm  in  one  graph.  Previous 
research  shows  that  one  does  not  have  to  perform  tests  to  determine  the  ROC  curve  of  this  new 
fused  classifier.  If  the  ROC  curve  for  each  individual  classifier  has  been  determined,  then  formulas 
for  the  ROC  curve  of  the  fused  classifier  exist  for  certain  fusion  rules.  This  will  be  an  enormous  sav¬ 
ing  in  time  and  money  since  the  performance  of  many  fused  classifiers  can  be  determined  without 
having  to  perform  tests  on  each  one. 

In  reality  only  finite  data  is  available  so  only  an  estimated  ROC  curve  can  be  constructed.  It 
has  been  proven  that  estimated  ROC  curves  will  converge  to  the  true  ROC  curve  in  probability. 
This  research  examines  if  convergence  is  preserved  when  these  estimated  ROC  curves  are  fused.  It 
provides  a  general  result  for  fusion  rules  that  are  governed  by  a  Lipschitz  continuous  ROC  fusion 
function  and  establishes  a  metric  that  can  be  used  to  prove  this  convergence.  This  framework  is 
then  applied  to  the  OR  fusion  rule,  as  well  as  an  example  study.  The  study  examines  two  ROC 
curves,  estimated  both  parametrically  and  non-parametrically,  fused  with  the  OR  rule. 
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CONSISTENCY  RESULTS  FOR  THE  ROC  CURVES  OF 

FUSED  CLASSIFIERS 

I.  Introduction 


1.1  General  Discussion 

Combat  identification  (C-ID),  the  ability  to  detect  and  classify  friend  versus  foe,  has  moved  to 
the  forefront  of  technological  challenges  facing  the  U.S.  military  today.  Even  while  the  military 
superiority  of  coalition  forces  has  been  overwhelming  in  recent  conflicts,  and  losses  have  been  rela¬ 
tively  low,  an  increasing  percentage  of  losses  has  been  due  to  friendly  fire.  The  U.S.  military  has 
invested  a  large  amount  of  resources  to  solve  this  problem.  Reduced,  ideally  zero,  fratricide  rate 
and  increased  lethality  are  the  goals.  To  that  end,  new  sensors  are  being  developed  and  fielded  that 
can  more  effectively  exploit  target  signature  data  to  more  confidently  make  the  determination  of 
hostile  or  friendly.  Building  hardware  capable  of  extracting  this  signature  information  is  one  part 
of  the  solution,  the  other  part  is  properly  classifying  it. 

Many  different  classification  techniques  have  been  developed  over  the  years.  One  popular 
technique  is  Fisher’s  linear  discriminant,  where  multi-dimensional  data  from  two  classes  are  projected 
onto  a  one-dimensional  space  making  for  easy  discrimination  [7].  Another  approach  is  to  utilize  a 
neural  network  that  learns  how  to  classify  data  [24],  Along  the  same  lines,  Support  Vector  Machines, 
borrowing  a  page  from  statistical  learning  theory,  develop  a  classifier  by  minimizing  the  training  set 
error  [21].  With  the  availability  of  many  sensors  and  many  classification  techniques,  military  CID 
systems  generally  do  not  employ  just  a  single  classifier  system,  but  seek  to  optimize  performance  by 
fusing  the  classification  decisions  of  multiple  systems. 

1.2  Problem  Description 

1.2.1  Background. 

A  classifier  system  in  its  most  elementary  form  consists  of  a  sensor,  a  processor,  and  a  classifier. 
Figure  1.1  depicts  this  notional  classifier  system.  The  sensor  starts  by  collecting  raw  data  on  an 
event  it  observes.  This  could  be  a  thermal  sensor  collecting  temperature  data  or  a  radar  collecting 


radio  frequency  returns.  The  processor  then  extracts  a  feature  from  this  raw  data  that  is  salient  to 


discrimination.  This  could  be  a  temperature  profile  or  a  radar  cross  section.  Finally  the  classifier 
applies  its  decision  logic  to  the  feature  set,  classifying  the  observed  event.  The  system  depicted  in 
Figure  1.1  is  a  two-state  classifier,  where  the  decision  label  is  either  target  or  non-target. 

Multiple  classifier  systems  (MCSs)  seek  to  increase  the  performance  of  individual  classifiers  by 
intelligently  fusing  their  outputs  [17].  Figure  1.2  depicts  a  notional  MCS.  In  the  depicted  MCS, 
two  single  classifier  systems  each  classify  an  observed  event  and  their  decision  labels  are  fused  using 
some  fusion  rule.  In  general  the  classifiers  are  assumed  to  be  independent,  although  a  significant 
amount  of  research  has  been  done  for  the  case  where  classifiers  are  correlated  [15]. 

There  are  many  advantages  to  MCSs.  For  example,  each  individual  classifier  system  can  focus 
on  a  different  type  of  feature  data,  such  as  length  or  temperature  gradient.  They  can  also  be 
specifically  trained  to  classify  different  target  types;  one  could  specialize  in  identifying  trucks  while 
another  could  specialize  in  tanks.  With  this  in  mind,  it  is  important  that  fusion  strategies  optimize 
the  strengths  and  minimize  the  weaknesses  of  the  individual  classifiers. 

Researchers  can  compare  classifier  performance  by  constructing  receiver  operating  characteris¬ 
tic  (ROC)  curves  from  experimental  data.  To  generate  the  ROC  curve  for  an  MCS  would  typically 
require  additional  experiments  beyond  those  used  to  produce  the  ROCs  for  the  individual  classifiers. 
Oxley  and  Bauer,  however,  demonstrated  that  this  is  not  always  required  and  that  for  certain  fusion 
rules,  the  ROC  curve  for  an  MCS  can  be  determined  solely  from  the  ROC  curves  of  the  individual 
classifiers  [16].  Hill  further  contributed  to  this  area  by  devising  a  matrix-based  approach  for  fusing 
ROC  curves  with  any  Boolean  rule  [13]. 
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Figure  1.1.  Single  classifier  system. 
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Figure  1.2.  Multiple  classifier  system  with  label  fusion. 
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1.2.2  Problem  Statement. 


In  real  world  applications,  the  true  ROC  curve  is  seldom,  if  ever,  known.  ROC  curves  must 
be  estimated,  based  on  sample  data.  This  fact  is  generally  disregarded,  and  an  estimated  ROC 
curve  is  treated  as  if  it  were  the  true  ROC  curve  representing  the  true  performance  of  the  classifier. 
Alsing  examined  this  assumption  and  proved  that  estimated  ROC  curves  do  converge  to  the  true 
ROC  curve  in  the  probability  sense  [1].  This  thesis  extends  Alsing’s  convergence  work  to  the  ROC 
curves  of  fused  classifiers. 

1.3  Research  Objectives 

The  goal  of  this  research  is  to  provide  consistency  results  for  the  ROC  curves  of  fused  classifiers. 
The  approach  requires  a  mathematical  framework  to  be  developed  that  can  be  used  to  prove  that 
fused  empirical  ROC  curves  do  converge  to  the  true  fused  ROC  curve.  A  general  class  of  fusion  rules 
will  be  investigated  using  this  framework,  in  addition  to  a  specific  examination  of  the  OR  fusion 
rule.  For  this  research,  independence  of  the  classifiers  is  assumed. 
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II.  Literature  Review 


2. 1  Overview 

This  chapter  reviews  the  literature  pertinent  to  receiver  operating  characteristic  (ROC)  curves 
and  their  use  in  evaluating  classifier  performance.  Section  2.2  provides  a  description  of  ROC  curves, 
demonstrates  how  their  graphical  representation  enables  a  quick  visual  comparison  of  classifier  per¬ 
formance,  and  provides  a  mathematical  description  of  their  construction.  Section  2.3  discusses  the 
notion  of  ROC  convergence  and  introduces  the  metrics  needed  to  compare  ROC  curves.  Much  of 
the  work  done  with  ROC  curves  assumes  that  a  limiting  or  true  ROC  curve  exists,  so  this  is  an  im¬ 
portant  contribution.  Finally  Section  2.4  reviews  different  methods  of  fusing  classifiers  and  presents 
an  important  result  that  allows  two  ROC  curves  to  be  fused  analytically. 

2.2  ROC  Curves 

2.2.1  Description. 

Consider  a  two-class  classifier  system  that  attempts  to  classify  targets  of  interest  ( tar  class)  and 
non-targets  ( non  class).  There  are  four  possible  outcomes  when  this  system  attempts  to  classify 
an  object.  When  a  tar  class  object  is  observed,  it  can  be  correctly  labeled  a  tar  or  incorrectly 
labeled  a  Likewise,  when  a  non  class  object  is  observed,  it  can  be  correctly  labeled  a  non  or 

incorrectly  labeled  a  tar.  Table  2.1  summarizes  these  outcomes  with  their  associated  terminologies 
and  conditional  probabilities. 


Table  2.1.  Classification  outcomes. 


Outcome 

Terminology 

Conditional  Probability 

tar  labeled  tar 

True  Positive  (TP) 

Ptp  —  Pr  (labeled  tar  tar  present) 

non  labeled  tar 

False  Positive  (FP) 

Pfp  =  Pr(labeled  tar  non  present) 

tar  labeled  non 

False  Negative  (FN) 

PFN  =  Pr(labeled  non  tar  present). 

non  labeled  non 

True  Negative  (TN) 

Ptn  =  Pr(labeled  non  non  present). 

The  ROC  curve  provides  a  visual  description  of  classifier  performance  by  graphically  represent¬ 
ing  the  trade-off  between  Ptp  and  Pfp  (Figure  2.1).  Commonly,  the  probability  of  true  positive 
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is  referred  to  as  the  hit  rate  or  the  probability  of  detection,  while  the  probability  of  false  positive  is 
often  referred  to  as  the  false  alarm  rate  or  probability  of  false  alarm  (as  in  the  figure).  The  ROC 
curve  is  constructed  by  varying  the  decision  thresholds  internal  to  the  classifier  system  and  measur¬ 
ing  the  observed  hit  and  false  alarm  rate.  A  very  conservative  decision  threshold  will  yield  a  low 
hit  rate  and  low  false  alarm  rate.  An  aggressive  decision  threshold  will  yield  a  high  hit  rate,  but 
generally  at  the  expense  of  a  high  false  alarm  rate. 


0.2  0.4  0.6  0.8  1 

Probability  of  False  Alarm 


Figure  2.1.  A  typical  ROC  curve. 

When  considering  the  total  probability  for  each  condition  (i.e. ,  the  condition  when  a  tar  is 
present  and  the  condition  when  a  non  is  present),  the  following  relationships  are  observed, 

Ftp  +  Pfn  =  1 

Pfp  +  Ptn  =  1- 

So  not  only  does  the  ROC  curve  graphically  relate  Ptp  and  Pfp,  but,  by  virtue  of  their  complements, 
provides  information  about  Ptn  and  Pfn  as  well. 

2.2.2  ROC  Curve  Construction. 

Consider  a  simple  two-class  classifier  system,  like  the  one  from  Section  2.2.1,  that  evaluates  a 
single  feature  x  €  R.  Figure  2.2  shows  the  distribution  of  this  feature  for  both  the  tar  and  non 
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classes.  Let  X  be  a  real-valued  random  variable  and  p(x)  represent  its  probability  density  function 
(pdf).  The  distribution  of  x  for  the  tar  class  is  the  conditional  pdf  p(x\tar)  and  the  distribution  of 
x  for  the  non  class  is  the  conditional  pdf  p(x\non). 


Figure  2.2.  Target  and  non-target  pdfs  for  a  two-class  system. 

In  this  feature  space,  lower  values  of  x  are  considered  a  stronger  indication  of  a  target.  So  the 
decision  labels  are 

tar  if  x  <  9, 
non  if  x  >  9, 

where  9  is  the  decision  threshold.  The  horizontal  shaded  area  represents  the  probability  of  true 
positive  at  8 ,  while  the  vertical  shaded  area  represents  the  probability  of  false  positive  at  9.  The 
ROC  curve  simply  is  the  plot  of  each  probability  pair  ( PFP ,  PTp )  for  all  8  values. 
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2.2.3  Mathematical  Description  of  ROC  Curves. 

Since  Pfp  and  Ppp  are  functions  of  6,  the  ROC  curve  is  actually  the  projection  of  a  three- 
dimensional  trajectory  in  (9,  Ppp,  Pfp)~ space  onto  the  two-dimensional  (Ptp,  Pfp)~ plane.  This 
is  depicted  in  Figure  2.3.  The  3-dimensional  trajectory  is  known  as  the  ROC  trajectory.  Alsing 
defines  the  ROC  trajectory  as 

t  =  {(91PFP(9)1PTp(9))  :0€0}, 

where  O  is  the  admissible  threshold  set  [1],  The  admissible  threshold  set  for  the  random  variable 
X  is  the  set  O  =  (a,  b)  C  R.  such  that 

lim  Pfp(O)  =  0  and  lim  Ptp(O)  =  0, 

9^a+  8^a+ 

lim  Pfp{9)  =  1  and  lim  Ptp(9 )  =  1. 

e^b-  e^b- 

The  ROC  curve  /  is  defined  as  the  projection  of  t  onto  the  (Ptp,  Pfp)- plane, 

f  =  {(PFp(0),PrP(0))  :0e0}.  (2.1) 

Some  other  properties  of  the  ROC  curve  are  that  /  is  a  non-decreasing  and  upper  semi-continuous 
function  of  9. 


Figure  2.3.  ROC  trajectory  and  ROC  curve  projection. 


In  applications,  the  probabilities  Ppp  and  Ppp  cannot  be  determined  exactly  since  only  finite 
data  is  available.  Hence,  the  true  ROC  curve  cannot  be  generated,  but  must  be  estimated.  These 
estimated  ROC  curves  are  constructed  empirically.  There  are  three  methods  commonly  used  to 
determine  the  estimates  Ppp  and  PFp; 

1.  Assume  the  family  of  the  underlying  distribution  (binomial  or  normal,  for  instance)  [10]  is 
known. 

2.  Estimate  the  unknown  distribution  based  on  the  sample  data  [12]. 

3.  Calculate  the  observed  true  positive  and  false  positive  frequencies  for  varying  9. 

Methods  1  and  2  are  referred  to  as  parametric  methods,  since  they  require  key  parameters  of 
the  statistical  distribution  to  be  known  or  estimated.  Method  3  is  referred  to  as  a  non-parametric 
method,  since  it  does  not  specifically  require  knowledge  of  the  underlying  distribution.  This  last 
method  is  the  one  most  commonly  used  in  practice,  although  this  research  will  investigate  both 
parametric  and  non-parametric  methods. 

The  estimated  probabilities,  Ppp  and  Pfp,  are  random  variables  since  they  depend  on  the  data 
used  to  estimate  them.  Let  u)  be  the  specific  instantiation  of  this  data  drawn  from  all  possible  sets 
of  observed  data  Q  (also  known  as  the  uber-event  set  or  population  set).  The  observed  data  itself 
can  be  represented  as  a  set  of  feature  vectors  Xj,i  =  1  to  2 n,  where  there  are  n  observations  from 
both  classes.  The  classifier  evaluates  the  feature  vectors  for  a  specific  decision  threshold  9  €  0. 
Therefore,  each  probability  estimate  can  be  expressed  as  a  function  of  to  and  9,  for  n  observations: 

Ppp  (u),  9)  and  Ppp(u),  9). 

The  estimated,  or  empirical,  ROC  curve  then  is  given  by 

/(n)  M  =  {(£$(“>,  9),  Pj#(v,  9)):9g  0}. 
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2.3  ROC  Curve  Comparison 


2.3.1  ROC  Metrics. 

In  the  automatic  target  recognition  community,  it  is  a  commonly  held  assumption  that  in 
the  case  of  unlimited  data,  a  limiting  ROC  curve  exists  [2].  As  n  becomes  large,  should 

converge  to  this  limiting  ROC  curve  /.  Alsing  provided  a  proof  for  ROC  convergence  and  in  doing 
so  developed  some  very  valuable  metrics  for  ROC  comparison  [I], 

Fristedt  and  Gray,  with  a  slight  difference  in  notation,  provide  the  following  definition  of  a 
metric  and  metric  space  [8]. 

Definition  2.1  (Metric  Space.)  A  metric  space  (S',  d)  consists  of  a  function  d  :  S  x  S  — *•  M 
defined  on  set  S  that  satisfies  the  following  properties: 

1.  d{x,  y)  >  0  for  all  x,y  £  S; 

2.  d(x,  y)  =  0  if  and  only  if  y  =  x: 

3.  d{x,  y)  =  d(y,  x)  for  all  x,y  £  S; 

4.  d(x,  z )  <  d{ x,  y)  +  d(y,  z)  for  all  x,  y  £  S. 

The  function  d  is  called  the  metric. 

As  an  example,  consider  the  following  family  of  metrics  defined  on  S  =  M2  =  {x  =  (aq,  x2)\xi  £ 
R}, 

y)  =  (|aq  —  yi\q  +  \x2  -y2\q)«  (2.2) 

for  each  q  £  [1,  oo).  This  family  of  metrics  will  be  used  extensively  in  this  research.  A  function  that 
adheres  to  all  but  property  2  is  known  as  a  pseudo-metric  and  is  associated  with  a  pseudo-metric 
space.  As  an  example,  area  under  the  curve  (AUC),  is  commonly  used  to  compare  ROC  curves. 
The  difference  in  AUC  for  two  curves,  /  and  g,  is  a  pseudo-metric  since  the  difference  could  equal 
zero  for  some  f  ^  g. 

Definition  2.2  (Equivalent  Metrics.)  Let  da  and  dp  be  two  metrics  defined  on  S.  The  metrics 
da  and  dp  are  equivalent  if  there  exists  constants  k,K  >  0  such  that 

kdg(x,y)  <  da{x,y)  <  Kdp(x,y)  for  all  x,  y  €  S. 
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Furthermore,  it  can  be  shown  that  all  metrics  on  M2  are  equivalent  [1]. 

Consider  two  ROC  curves  /  and  g  with  the  same  threshold  set  0  and  represented  by 


/  =  {(42(0),P$(0)):0e0} 

=  {P  if\6):9eQ}, 

g  =  {(P^(0),pM(0)):eeQ} 

=  {P(9>(«):fle0}. 

Alsing  proposed  the  following  metric  on  ROC  curves  for  1  <  r  <  oo,  where  p  is  any  metric  on  M2, 

dPAf>9)  =  (J  p(pV\e),pW(O))rd0Y  .  (2.3) 

One  metric  from  this  family, 

dPlAf,g)  =  jQP1(Pif\9),P^(9))de  (2.4) 

provides  the  total  metric  distance  between  /  and  g.  This  metric  can  be  normalized  to  a  value  between 
0  and  1  by  dividing  by  the  measure  of  0,  p(0).  This  is  known  as  the  average  metric  distance.  Like 
the  difference  in  AUC  pseudo- metric,  average  metric  distance  takes  on  values  between  0  and  1.  For 
both  metrics,  smaller  values  indicate  similar  overall  performance  between  two  classifier  systems,  and 
larger  values  indicate  differing  overall  performance.  The  important  distinction  though  is  that  an 
average  metric  distance  of  0  implies  that  /  and  g  are  the  same  curve;  while,  as  stated  previously,  a 
difference  in  AUC  of  0  is  ambiguous  as  to  the  difference.  This  last  fact  is  why  dPl: \  is  utilized  in 
proving  convergence.  For  empirical  ROC  curves,  the  discrete  form  is  often  employed, 

m 

X>i  (p(/)  (00,  P(fl)  (<?<)) 

average  metric  distance  «  — - ,  (2.5) 

m 

where  m  is  the  total  number  of  evenly  spaced  discrete  0t  values  in  0. 


11 


2.3.2  ROC  Convergence. 

If  empirical  ROC  curves  do  not  converge  to  a  limiting  ROC  curve,  then  there  is  no  guarantee  of 
consistency  in  ROC  comparisons  [1],  To  that  end,  Alsing  developed  the  ROC  Convergence  Theorem 
to  prove  that  empirical  ROC  curves  do,  indeed,  converge  provided  certain  conditions.  The  most 
important  condition  is  that  as  more  feature  data  is  collected,  it  is  collected  in  such  a  way  that  it 
adequately  fills  the  feature  space. 

Let  S  C  IC ,  where  v  is  the  number  of  elements  in  the  feature  vector  x,  and  S  is  the  total 
feature  set.  Let  2?0)  c  S  be  the  set  of  n  feature  vectors  collected  per  class,  i.e.,  V (n')  =  {x;  €  S  : 
i  =  1,  ...,2n}  when  there  are  two  classes.  To  ensure  that  as  more  feature  data  is  collected  it  spans 
across  the  entire  feature  space  and  not  just  a  single  subset  associated  with  one  label,  Alsing  requires 
the  sequence  of  sets,  {T>(n>},  to  converge  to  S  in  the  Hausdorff  metric  [4].  Under  this  condition, 
the  sequence  of  empirical  ROC  curves,  {f(n>},  converges  to  a  limiting  ROC  curve  /. 

Theorem  2.3  (ROC  Convergence  [1].)  If  {£>(")}  converges  to  S  in  the  Hausdorff  sense,  i.e., 
given  s  0,  there  exists  H  such  that  for  all  n  'p  H ,  dp  ( T>(n\S )  <  e,  then  {/(")(w)}  converges  to  f 
in  probability,  i.e.,  given  e  >  0,  there  exists  N  such  that  for  all  n  >  N, 

Pr  (jw  €  O  :  dPtr  (/(ti)M,/)  >  ej)  <  e. 

The  proof  is  rather  lengthy,  so  only  an  outline  will  be  provided.  The  proof  involves  four  steps: 

1.  Prove  that  P^p(aj,9)  is  a  consistent  estimator  for  Ptp(0)  and  that  Pfp(u),9)  is  a  consistent 
estimator  for  Ppp{9). 

2.  Prove  pointwise  convergence  for  the  estimated  probability  pair  (^Pffp{u),  9),Ppp(u),  9)j  . 

3.  Prove  that  the  integral  of  a  real-valued  random  variable  converges. 

4.  Prove  that  the  sequence  of  ROC  curves  {f^n\Lo)}  converges. 

2-4  Classifier  and  ROC  Fusion 

2.4.I  Classifier  Fusion  Rules. 

There  is  often  a  performance  advantage  to  be  gained  by  fusing  the  results  of  individual  classi¬ 
fiers.  Consider  Bayesian  classifiers  that  estimate  the  posterior  probability  of  a  particular  observa¬ 
tion  belonging  to  a  certain  class.  Since  this  estimate  is  based  on  observed  data,  there  is  inherent 
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sample  variance  associated  with  these  estimates.  When  these  estimates  are  averaged  over  several 
Bayesian  classifiers,  this  variance  is  reduced  [18].  Also,  certain  classifiers  may  have  better  perfor¬ 
mance  against  specific  targets  or  in  certain  situations  [20] .  In  this  case,  a  fusion  strategy  could  be 
devised  that  weights  a  specific  classifier’s  decision  more  heavily  under  certain  conditions. 

According  to  Dasarathy,  classifier  decision  fusion  is  simply  a  subset  of  sensor  fusion  [6].  He 
notes  that  fusion  can  be  accomplished  at  the  data,  feature,  or  decision  level  with  the  following  inputs 
and  outputs: 

1.  Data  in  i — >  Data  out, 

2.  Data  in  i — >  Feature  out, 

3.  Feature  in  i — >  Feature  out, 

4.  Feature  in  \ — >  Decision  out, 

5.  Decision  in  i — >  Decision  out. 

Thorsen  describes  data  fusion  using  category  theory  [19].  In  category  theory,  objects  are 
represented  by  boxes  and  mappings  are  represented  by  arrows.  Referencing  Figure  1.2;  data,  feature, 
and  label  fusion  then  is  the  fusion  of  objects;  while  sensor,  processor,  and  classifier  fusion  is  the  fusion 
of  mappings.  It  is  interesting  to  note  that  with  the  exception  of  sensor  fusion,  Dasarathy’s  and 
Thorsen’s  fusion  categories  parallel  each  other  (Table  2.2).  Sensor  fusion  would  be  the  equivalent 
of  “Event  in  — >  Data  out”  fusion,  which  Dasarathy  does  not  categorize. 


Table  2.2.  Fusion  categories. 


Dasarathy’s  I/O  Category 

Thorsen’s  Object/Mapping  Fusion 

Data/Data 

Data  Fusion 

Data/Feature 

Processor  Fusion 

Feature /Feature 

Feature  Fusion 

Feature/Decision 

Classifier  Fusion 

Decision/Decision 

Label  Fusion 

Decision,  or  label,  fusion  is  the  focus  of  this  research.  There  are  many  advantages  to  this  type 
of  fusion.  There  is  no  concern  about  the  compatibility  of  the  data,  nor  the  specific  architecture  used 
to  design  individual  classifiers  when  the  fuser  only  considers  discrete  decision  values.  A  practical 
consideration  may  also  be  the  amount  of  bandwidth  or  processing  capability  the  fusion  center  has 
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[22].  Once  again,  a  decision  fuser  is  not  burdened  by  having  to  transmit  or  process  a  large  amount 
of  raw  or  feature  data. 

The  simplest  method  for  fusing  data  is  to  use  Boolean  rules.  Consider  the  case  where  there  are 
two  classifiers,  A  and  B.  A  fusion  approach  that  wants  to  maximize  the  number  of  true  positives, 
at  the  possible  expense  of  an  increased  false  positive  rate,  will  make  a  target  declaration  if  either  A 
or  B  indicates  a  target.  This  is  known  as  the  OR  fusion  rule  and  will  be  denoted  as  dVB.  A  more 
conservative  approach  that  tries  to  maintain  a  low  false  positive  rate  will  make  a  target  declaration 
only  if  both  A  and  B  indicate  a  target.  This  is  known  as  the  AND  fusion  rule  and  will  be  denoted 
as  A  A  B.  Fusion  can  be  extended  to  include  a  third  classifier  C.  In  this  case,  the  majority  vote 
fusion  rule  can  be  used.  The  majority  vote,  in  the  three  classifier  case,  requires  two  of  the  three 
individual  classifiers  to  agree.  The  formula  for  this  rule  is  (A  A  B)  V  {A  A  C)  V  (B  A  C).  Table  2.3 
summarizes  these  fusion  rules.  Of  course,  these  rules  can  be  extended  to  any  number  of  classifiers. 


Table  2.3.  Boolean  fusion  rules. 


Fusion  Rule 

Notation 

2-classifier  OR 

A\J  B 

2-classifier  AND 

aab 

3-classifier  Majority  Vote 

{A  A  B)  V  {A  A  C)  V  (B  A  C) 

2-4-2  Within  and  Across  Fusion. 

Hill  discusses  two  types  of  fusion  on  MCSs:  across  and  within  [13].  Figures  2.4  and  2.5  show 
the  block  diagrams  that  Hill  developed  for  both  ROC  fusion  types.  In  the  within-MCS,  the  event 
set  E  is  partitioned  into  two  event  subsets:  targets  of  interest  ( Etar )  and  non-targets  ( Enon ).  Each 
individual  classifier  system  seeks  to  classify  the  same  type  of  target.  The  decision  from  each  classifier 
is  then  sent  to  the  fusion  center.  In  the  across-MCS,  the  event  set  E  is  partitioned  into  three  subsets: 
target  type  1  (Etar, i),  target  type  2  {Etar, 2),  and  non-targets  ( Enon ).  Each  individual  classifier 
system  seeks  different  types  of  targets  (e.g.,  one  is  trained  to  detect  tanks,  while  another  is  trained 
to  detect  troop  carriers).  The  decision  from  each  classifier  is  sent  to  the  fusion  center.  Note  that 
for  ease  of  notation,  the  processor  is  assumed  to  be  part  of  the  sensor  in  these  MCS  representations. 
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Sensors 


Classifiers 


Fusion 


Figure  2.4.  Within-MCS. 


Sensors  Classifiers  Fusion 


Feature  Individual 

Sets  Label  Sets 

X  L 


Figure  2.5.  Across-MCS. 
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For  both  types  of  MCSs,  a  compact  notation  can  be  defined.  Figure  2.6  shows  the  block 
diagram.  Classifier  Ag  maps  feature  set  X  to  label  set  L  and  classifier  B $  maps  feature  set  Y  to 
label  set  M.  More  formally  Ag  :  X  — >  L  for  0  £  0,  where  0  is  the  admissible  threshold  set  for  Ag, 
and  :  T  — >  M  for  (f>  £  <f>,  where  <F  is  the  admissible  threshold  set  for  B^.  The  fused  classifier 
Cg}(p  is  the  result  of  the  fusion  rule  r  acting  on  the  labels  or 

Cg,4,{x,y)  =  v(Ae{x),B<j>(y)), 

where  ( x,y )  €  X  x  Y  and  (0,  </>)  e9x$.  This  can  also  be  stated  in  terms  of  the  label  fusion  as 
r:  L  x  M  ->  N. 

One  final  item  on  notation  is  presented.  When  referring  to  the  families  of  classifiers  for  Ag, 
B<j>,  and  Cg^,  the  following  notation  is  used 

A  =  {Ag:deQ}, 

B  =  { B ^  :  (f>  €  <&}  ,  and 

C  =  {<?„,*  :0€0,0< E$}. 

This  is  relevant  when  referring  to  the  ROC  curves  for  these  classifiers,  since  the  ROC  curve  represents 
the  performance  of  the  entire  classifier  family,  not  just  the  classifier  for  a  given  threshold. 


Figure  2.6.  Compact  notation  for  MCS. 
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2-4-3  Independence. 

When  analyzing  multiple  classifier  systems  (MCSs)  it  is  often  assumed  that  the  classifiers  are 
statistically  independent  [14] .  More  accurately,  it  is  assumed  that  the  measurements  of  the  classifiers 
are  independent.  Two  events,  E\  and  E2.  are  statistically  independent  if  the  following  condition  is 
met, 

Pv(E1nE2)  =  Pv(E1)-Pv(E2). 

To  extend  this  idea  to  the  independence  of  classifiers,  consider  classifiers  A,  B  :  X  — >  L.  Classifiers 
A  and  B  are  said  to  be  independent  if  for  every  subset  of  labels  Li,L2  C  L, 

Pr (A-1^]  n  B~1[L2\)  =  Pr(A_1[L1])  •  Pr^"1^]) 

where  A-1]!^]  and  B^1[L2]  are  the  inverse  images  of  the  classifiers.  That  is, 

A-^Li]  =  {x  G  X  :  A(x)  G  L{\  and  B~1[L2\  =  {x  G  X  :  B{x)  G  L2}. 

With  this  assumption,  analysts  can  more  easily  compute  the  joint  probabilities  of  an  MCS  by 
not  having  to  consider  the  correlation  between  the  classifiers.  Any  in-depth  analysis  of  a  specific 
MCS  should  examine  whether  this  assumption  is  valid. 

In  some  situations  it  is  appropriate  to  assume  that  two  classifiers  are  conditionally  independent. 
Two  events,  Ei  and  E2 ,  are  said  to  be  conditionally  independent  if  they  are  independent  with  respect 
to  the  same  condition  G.  In  other  words, 

Pr(Ei  CE2\G)  =  Pr(£i|G)  •  Pr (E2\G). 

Similar  to  statistical  independence  this  can  be  very  helpful  when  calculating  joint  conditional  proba¬ 
bilities,  but  once  again,  the  validity  of  this  assumption  should  be  examined  for  any  specific  analysis. 

Some  researchers  are  less  concerned  with  the  classifiers  being  independent  and  more  concerned 
with  the  classifiers  making  independent  errors.  For  classifiers  to  make  independent  errors,  they 
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must  satisfy  the  following  conditions, 


Pfp(A-1[Ltor]nB-1[Ltor]|®  €  Xnon)  =  PMA-^LtarWx  G  Xnon)  ■  Pfp{B~1[L  tar  }\x  G  Xnon ),  and 

PFNiA-^L^}  n  B-^L^Wx  G  Xtor)  =  PFNiA-^L^Wx  G  Xtor)  •  PFN(B^[L  non  ]  |x  G  Xtar'). 

A  number  of  researchers  have  used  this  assumption  [9],  [11].  Kuncheva  contends  however,  that 
negatively  correlated  errors  can  actually  improve  the  performance  of  an  MCS  [15]. 

2-4-4  ROC  Fusion. 

Oxley  and  Bauer  demonstrated  that  one  can  analytically  construct  a  ROC  curve  for  an  MCS 
simply  by  using  the  ROC  curves  of  the  individual  classifiers  [16].  They  begin  by  providing  the 
following  definition  for  ROC  curves.  The  ROC  curve  /a  is  defined  as 

/A(p)  =  rna x{PTP(Ae)  :  9  G  0  and  PFP(Ae)  =  p}  (2.6) 

for  each  p  G  [0, 1].  In  other  words,  if  there  are  multiple  9  values  for  which  Pfp{Aq )  =  p,  then  the 
highest  associated  Ppp(Ao)  value  is  chosen  to  construct  the  ROC  curve. 

Similarly,  the  ROC  curve  /b  is  defined  as 

fa(q)  =  ma x{PTP(B<p)  :  (j)  G  $  and  PFP(B<f>)  =  q}  (2.7) 

for  each  q  G  [0, 1]. 

Finally,  the  ROC  curve  fc  is  defined  to  be 

fc(r )  =  max{PTP{Cg^)  :  (0,  </>)  G  ©  x  <f>  and  PFP(Cg t<j>)  =  r}  (2.8) 

for  each  r  G  [0, 1]. 

For  the  across-OR,  Oxley  and  Bauer  showed  that 

PFP(Cgt(i,)  =  PFP(Ag)  +  PFP{B,p)  -  PFP(Ag)PFP(B,f)),  (2.9) 
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and 


Ptp(C^)  =  a{1+  p  a}^PFP{Ae)  +  Q  +  appTp(Ae)  +  p  ^ PMB *) 

+  a  +  0-a0Prp{B*)  al^apPMAe)PTp{B^ 

- al+paJaf3PTp{Ae)PMB *}  "  a  +  apj-apPMAe)PTp{B^ 

where  a  =  Pr(Xtar)  and  f3  =  Pr(Utar)  are  the  a  priori  probabilities. 

Considering  Equation  (2.9)  and  using  the  notation  from  the  ROC  curve  definitions,  it  is  seen 


that 


r  =  p  +  q  —  pq 


for  a  choice  of  p  and  q.  For  a  given  r  €  [0, 1]  then,  p  or  q  will  be  constrained  such  that  if  q  €  [0, 1], 
then  p  £  [0,  r] .  In  particular,  for  each  r, 


.  (  t — —  for  0  <  p  <  r  when  r  <  1 

q  =  Q{P,r)  =  <  ,p  f  i  -i 

[  1  lor  p  =  r  when  r  =  1 

Using  this  relationship  and  Equation  (2.8),  it  was  derived  in  [16]  that  the  formula  for  the  fused 
across-OR  ROC  curve  is 


fc(r)  = 


1  (1  —  7  )r 

7 

1 


7 


(2.10) 


„  min  [1  -  (afA(p)  +  (1  -  a)p)] 

^yO<p<r 


r  —  p 
1  ~P 


(1-/3) 


r  —  p 
1  ~P 


where  7  =  a  +  (3  —  aj3. 

In  similar  fashion,  it  was  derived  in  [16]  that  the  fused  within-OR  ROC  curve  is 


Mr) 


max  fA(p) 

0<p<r 


r  —  p 
1  -p 


r  —  p 
1  -P 


(2.11) 


Again,  both  of  these  formulas  assume  that  the  classifiers  Ag  and  B $  are  independent. 
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III.  Derivation  and  Methodology 


3. 1  Introduction 

This  chapter  extends  Alsing’s  ROC  convergence  theorem  by  demonstrating  that  convergence 
is  preserved  when  two  empirical  ROC  curves  are  fused.  Section  3.2  introduces  the  framework  and 
notation  that  will  be  used  in  demonstrating  this  convergence.  Section  3.3  provides  the  general 
result  that  convergence  of  a  fused  ROC  curve  is  guaranteed  when  the  ROC  fusion  is  accomplished 
by  a  Lipschitz  continuous  transformation.  Section  3.4  applies  this  result  to  the  within-OR  and 
across-OR  fusion. 

3.2  Framework  and  Notation 

A  classifier  fusion  rule,  r,  establishes  how  classifiers  A  and  B  are  combined  into  a  fused  classifier 
C,  so  that  C  =  r (A,B).  As  Oxley  and  Bauer  demonstrated,  for  certain  fusion  rules  there  may 
be  a  mapping  or  transformation,  T,  that  relates  the  fused  ROC  curve,  /c,  to  the  ROC  curves  of 
the  individual  classifier  families,  /a  and  /b,  so  that  fc  =  /r(A,B)  =  T(/a,/b)-  Furthermore,  this 
transformation  may  have  a  function  g  associated  with  it  that  allows  for  a  pointwise  evaluation  of  the 
fused  ROC  curve.  Considering  the  individual  ROC  curves  as  functions  of  p  €  [0, 1],  the  following 
relations  hold  true 

fcip)  =  fr(A,B){p)  =  T(/A,/b)(p)  =  S(/a(p),/b(p))- 
Thus,  the  transformation  T  is  a  substitution  transformation. 

3.3  Convergence  for  Continuous  Substitution  Transformation 

This  section  looks  at  a  class  of  substitution  transformations,  S,  that  satisfy  the  Lipschitz 
condition.  If  a  particular  fusion  rule,  r,  yields  a  Lipschitz  continuous  substitution  transformation, 
S,  then  convergence  of  the  fused  ROC  curve  for  that  rule  is  guaranteed  by  this  result.  A  number 
of  definitions  and  theorems  will  be  required  to  establish  the  framework  to  prove  this  result. 

The  ROC  curve  /  has  been  discussed  in  some  detail  to  this  point,  but  a  few  more  remarks 
should  be  made  before  proceeding  with  this  analysis.  The  ROC  curve  /  has  been  referred  to 
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primarily  as  a  relation.  Apostol  defines  a  relation  as  any  set  of  ordered  pairs  [3].  This  is  consistent 
with  Equation  (2.1).  For  this  analysis,  however,  it  will  be  more  useful  to  think  of  /  as  a  function, 
such  that  for  (x,  y)  €  /,  there  exists  a  unique  value  y  =  f(x)  for  a  given  x.  With  this  in  mind,  the 
following  ROC  curve  definition  is  stated. 

Definition  3.1  (ROC  Curve.)  The  function  /  :  [0, 1]  — >  [0, 1]  is  said  to  be  a  ROC  curve  if  the 
domain  of  /  is  [0, 1]  and  /  is  non-decreasing  and  upper  semi-continuous.  Let  5J  denote  the  collection 
of  all  ROC  curves,  that  is 

SR  =  {/  :  [0, 1]  — *  [0, 1]  |  /  is  non-decreasing  and  upper  semi-continuous}. 

For  /,  g  €  SR,  the  following  mapping  is  introduced, 

dPl,i(f,g)=  [  Pi(f(p),g(p))dp,  (3.1) 

Jo 

where  p1  is  the  standard  metric  on  M.  Note  this  is  the  same  mapping  defined  by  Equation  (2.4), 
but  is  generalized  from  0-space  to  p-space.  If  Ppp(9 )  =  Ppp(9)  for  all  9  €  O,  then  Equation 
(2.4)  and  Equation  (3.1)  are  equivalent.  This  would  seem  to  imply  that  Equation  (3.1)  is  a  special 
case  of  Equation  (2.4),  however  it  actually  has  more  general  applicability.  Since  the  dPlt i  mapping 
from  Equation  (3.1)  is  defined  over  the  projected  ROC  curves  as  a  function  of  p,  it  does  not  require 
specific  knowledge  of  the  9  €  O  used  to  construct  the  ROC  curves.  This  is  very  important  when 
the  threshold  sets  for  A  and  B  are  not  equal  (i.e. ,  $  ^  O).  The  following  theorem  proves  that  the 
dPlt i  mapping  is  a  metric  on  5f. 

Theorem  3.2  (9?, dPl,i)  is  a  metric  space. 

Proof.  For  /a,  /b  €  !R,  it  must  be  shown  that  dPl>i(/A,  /b)  exists  and  satisfies  the  four  required 
properties  of  a  metric  (Definition  2.1).  Since  /a,  /b  €  5J,  then  /a  —  /b  has  bounded  variation.  Hence 
|  /a  —  /b|  has  bounded  variation  and  is  therefore  Riemann  integrable.  Thus, 

dPl,i{fA,  fa)  =  f  |/a(p)  -  h(jp)\dp 
Jo 

exists  for  every  /A,  /b  € 
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1.  Nonnegativity.  Since  \hip)  —  h(p)\  >  0.  This  implies  f0  Pi(/a(p),  hip))  dp  >  0.  Therefore 
dPl,iifA,h)  >  0  for  all  /A,  /b  and  so  dPlii  satisfies  nonnegativity. 

2.  Definiteness.  Assume  dPl,i(/A) /b)  =  0.  Then  \/a(p)  —  /s(p)|dp  =  0.  This  implies 
I/a(p)  —  /b(p)|  =  0  for  almost  every  p  £  [0,1],  but  \hip)  —  hip)\  of  bounded  variation  on 
[0, 1]  implies  there  are  only  a  finite  number  of  discontinuities.  Since  /a,  /b  are  upper  semi- 
continuous,  then  /a  =  /b- 

Now  assume  fA  =  fa,  then  \fA(p)  -  hip)\  =  0  for  all  p  £  [0,1]  and  implies  fQ  l(fA(p)  - 
hip)\dp  =  0.  So,  dPlA(fA,  =  0.  Therefore  dPl,i  satisfies  the  definiteness  property. 

3.  Symmetry.  Since  p1  is  a  metric  on  R,  then  p1(x,  y)  =  p1(y,x)  for  all  x,  y  £  M.  This  implies 
Pi(h(p),  hip))  =  Pi(h(p),  h(p))  for  all  p  =  [0, 1];  hence, 

Pi(h{p),fa(p))dp=  /  Pi(fa(p),fA(p))dp  =  dPlp(fa,fA). 

Jo 


Therefore  dPl  p  satisfies  the  symmetry  property. 


4.  Triangle  inequality.  Let  /a,/b,/d  €  Since  p1  is  a  metric  on  M,  then  p1(x,y)  <  p1(x,z)  + 
Pi(z,  V )  for  all  x,y,z  £  M.  Therefore  px{fA(p),  hip))  <  Piihip),  hip))  +  Piihip),  hip))  for 
all  p  =  [0, 1];  hence, 


dPl.iifA, , 


P\ifAip),  hip))dp 


< 


Piihip),  hip))dp 


Piihip),  hip))dp 


■Jo 


—  dPlpifA,  fa)  +  dPl) 


u/b,  . 


So  dPlJi  satisfies  the  triangle  inequality  property. 


Definition  3.3  (Average  metric  distance.)  For  /a,  fa  €  K,  the  mapping 

dPl,iifA,fa)  =  [  PiifA(p),hip))dp  (3.2) 

is  defined  to  be  the  average  metric  distance. 

Note  that  when  this  metric  was  defined  over  9  (See  Equation  (2.4).),  dPipifA,  h)  was  referred 
to  as  the  total  metric  distance.  Since  the  measure  of  the  interval  [0, 1]  is  1,  then  for  Equation  (3.2) 
the  total  metric  distance  and  average  metric  distance  are  equivalent,  and  the  dPihfk,h)  notation 
will  be  used  to  refer  to  both. 

Two  metrics  in  M2  will  also  be  used.  The  first  is  the  Manhattan  metric  on  a  2-element  vector. 
For  S  =  [0,  l]2  =  [0, 1]  x  [0, 1]  =  {x  =  (£,  ?y)  €  M2  |  0  <  f  <  1,  0  <  r\  <  1},  the  following  metric  is 
defined 

^2,(xi,x2)  =  1^  -£2|  +  \,q1  —  772|. 
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The  metric  space  f[0,  l]2,  p[2)^j  is  commonly  used  and  is  stated  without  proof.  Similarly  for  S  = 
5ft2  =  5ft  x  5ft,  the  following  metric  is  defined 

[(/a,  /b),  (/a',/b')]  =  dPl)i(/A,  /a')  + 

The  proof  that  (5ft2,  d^)  is  a  metric  space  is  a  simple  extension  of  Theorem  3.2.  These  metrics 
will  be  used  in  conjunction  with  the  following  definitions. 


Definition  3.4  (Lipschitz  continuous  mapping.)  Let  ( S,ds )  and  ( T,dr )  be  metric  spaces. 
The  mapping  a  :  S  — >  T  is  said  to  be  a  Lipschitz  continuous  mapping  if  there  exists  some  L  >  0 
such  that 

dr(cr(si),a(s2))  <  Lds(si,S2)  for  all  si,S2  €  S. 

Cip(S,T)  is  used  to  denote  the  collection  of  all  Lipschitz  continuous  mappings  from  ( S,ds )  into 

(T,dr). 


Definition  3.5  (ROC  fusion  function.)  The  function,  g  :  [0,  l]2  — >  [0, 1]  is  said  to  be  a  ROC 
fusion  function  if  the  domain  of  g  is  [0, 1]  x  [0, 1]  and  g  is  non-decreasing  in  both  variables.  Let  Q 
denote  the  collection  of  all  ROC  fusion  functions,  that  is 

Q  =  {g  :  [0,  l]2  — >  [0, 1]  |  g  is  non-decreasing  in  both  variables}. 


Notice  that  if  there  exists  an  L  >  0  such  that 

Pi(ft(xi)>ft(x2))  <  Vi2)(xi,x2) 

for  all  x1;X2  €  [0,  l]2,  then  g  :  [0,  l]2  — >  [0, 1]  is  a  Lipschitz  continuous  mapping  from  ^[0,  l]2,p^2^ 
into  ([0, 1], Pi),  or  g  e  Cip{[ 0,  l]2,  [0, 1]). 

Definition  3.6  (Substitution  transformation.)  For  g  :  [0,  l]2  — >  [0, 1],  the  substitution  trans¬ 
formation,  Ss  :  9ft2— >  9ft,  is  defined  as 

§fl(/A,  /b )(p)  =  g{fhip),  fm(p))  for  all  p  G  [0, 1]. 

Notice  that  if  there  exists  an  L  >  0  such  that 
dPl>l[Sg(/A,/B),Sg(/A',/B')]  < 

for  all  (/a,  /b),  (/a'>  fw)  €  9ft2,  then  §s  :  9ft2-^  9ft  is  a  Lipschitz  continuous  mapping  from  ^9ft2,  M 
into  (5ft,  dPl)i),  or  Sg  €  £ip(9ft2,9ft). 
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With  this  framework  developed,  the  main  result  can  now  be  proven.  The  approach  will  be  to 
demonstrate  that  if  a  ROC  fusion  function  is  Lipschitz  continuous,  then  the  associated  substitution 
transformation  is  also  Lipschitz  continuous.  When  this  Lipschitz  substitution  transformation  is 
used  to  fuse  two  individual  ROC  curves  that  converge  in  probability,  it  is  proven  that  the  fused 
ROC  curve  will  converge  in  probability  as  well. 

Theorem  3.7  Let  g  €  Cip([ 0,  l]2,  [0, 1])  D  Q,  then  S>g  €  Cipljft2 , 3?) . 

Proof.  Let  (/A,  /B),  (/a',  fm>)  €  3?2  and 


dPl  ,1  [Sfl  (/a  ,  /b)  ,  Sg  (/a'  ,  /B‘ 


■)]  -  f 


|§s(/a,/b)0)  -  Sg(fA'Js')(p)\dp. 


Show  that  the  Lipschitz  condition  is  met. 


dPl ,  i  [Sg  (/A ,  /b  )  i  (/A/ ,  )  ] 


=  /  |§s(/a,/b)(p) -§s(/a',/b')(p)Mp 

Jo 

=  [  \g{fh{p),h{p))- g{fk'(p),fw{p))\dp 

Jo 

<  [  £(|/a(p)  -  /a'(p)|  +  |/b(p)  -  /B'(p)|)dp 

Jo 

=  L  /  |/a(p)  -  /A' (p)|dp  +  R 
Jo 

=  i[rfPl,l(/A,/A')  +  rfpi,l(/B,/B')] 

=  wSiK/a./b),^',^)] 


I/bO)  -  /®'(p)|dp 


So  as  desired,  the  Lipschitz  condition  on  Ss  is  met 

dPl,i[Sg(h,fB),Sg(U'jB>)]  <  Ld^hJaUh-Jw)}, 

and  Sg  €  £ip(R2,  SR). 


Theorem  3.8  Assume  T  :  — >  (3?,  dPl,i)  is  a  Lipschitz  continuous  mapping.  Assume 

{/a^}>  {/b"^}  c  ^  converge  in  probability  to  /a  and  /b,  respectively,  then  f^'m)  =  T(/An\  /b™’*) 
converges  in  probability  to  fc  =  T(/a,  /b)- 
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Proof.  Let  L  >  0  be  the  Lipschitz  constant  for  T.  For  e  >  0,  there  exists  an  Ne  such  that 
for  all  n,m  >  Ne, 

Pr|cuA  €  0A  :  dPltl(f^\u}A),  /A)  >  ^-|  <  | 

and 

Pr  jcuB  €  fiB  :  dPl!i(f^n\u>a),  /b)  >  <  7j- 

To  prove  converges  in  probability  to  /c,  the  following  must  be  shown 

Pr  j(u>A,wB)  €  x  :  dPlii(/^m’n)(wA,  wB), /c)  >  e|  <  £• 

Consider  the  set 

|(wa,Wb)  €  X  fiB  :  dPlii(/^m’n)(a;A,wB),/c)  >  ej 

and  the  relationship  from  Theorem  3.7  yields 

dPlp(/cm'nVc),/c)  <  LdPltl(f{^l) +  LdPlA{f^m\ojn)JB). 

It  can  be  seen  that 

|(wA,wB)  €  fiA  X  nB  :  dPlji(/cm,,l)(cUA,wB),/c)  >  ej 
C  |(wA,cuB)  €  Qa  x  fiB  :  LdPl,i(f^\uA),  /a)  +  LdPl'i(fim) {un),  h)  >  e} 

C  |(wa,^b)  €  ^a  x  fiB  :  dPlti(f^I'> (uja),  /a)  >  2^| 
u|(wa,wb)  £  Ha  x  flj  :  dPl,i(/Bm^(^B), /b)  >  ^  j- 

Now  considering  the  probability  measure  of  this  event,  gives 

Pr  |(cua,<ub)  €  fiA  x  fiB  :  dPl>i(/cm’":) (wA,  wB),  /c)  >  ej 
<  Pr|(cuA,wB)  €  flA  x  fiB  :  dPl^1(f^l) (ua),  /a)  > 

+Pr  |(wa,wb)  €  I2a  x  f2B  :  dPl,i(/Bm) (^b), /b)  >  ^-}  • 
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By  the  independence  assumption  then, 

Pr  {(cua,Wb)  £  Qa  x  :  dPlA(f^\u> a), /a)  >  =  Pr  |wA  €  ^a  :  dPlii(f^\uJA),  /a)  > 

and 

Pr  |(wa,wb)  £  ^a  x  ClM  :  dPl,1(f^m) (lvb),  /b)  >  =  Pr  |wB  €  ^b  :  dPltl(f^n)  (uB),  fa)  >  • 

Therefore, 

Pr  |(wa,wb)  €  CIa  x  :  dPl!i(f^n’n\u>A,  wB),  /c)  >  ej 
<  -Pr  {wa  £  ^A  :  dPl,i{f^\ojA),  /a)  >  ^-}  +  Pr  |wB  £  ftB  :  dPl,1(f^m)  (wB),  /b)  >  ^-} 


as  desired,  and  f^l’m'> converges  in  probability  to  /c- 

This  is  an  important  result  that  covers  a  large  class  of  mappings.  Just  as  important  is  the 
framework  that  was  constructed.  This  framework  provides  a  valuable  tool  for  the  comparison  of 
ROC  curves  and  classifiers. 

3-4  Convergence  Framework  Application 

Since  the  transformation  for  the  within-OR  ROC  fusion  (Equation  2.11)  contains  a  maximiza¬ 
tion  and  the  transformation  for  the  across-OR  ROC  fusion  (Equation  2.10)  contains  a  minimization, 
it  will  be  difficult  to  show  that  these  are  Lipschitz  continuous  mappings. 

Consider  first  a  simpler  case,  the  ROC  fusion  function  </(£,  rj)  =  (  +  r]  —  (r/  so  that 

fc(p)  =  5(/a(p),  /b  ( P ))  =  /a(p)  +  fv  ( P )  -  /a(p)/b  ip)  • 

Since  /a  and  /B  are,  by  definition,  non-decreasing  functions  of  p,  then  for  fc  evaluated  at  r 

fc(r)  =  /aW  +  fv  ( r )  -  A(r)/B  (r)  >  max  |/A(p)  +  fv  “  MP)f®  }  ' 

Therefore,  fc  is  actually  an  upper  bound  for  the  fused  OR  curve.  This  is  consistent  with  Clutz’s 
research.  Clutz  showed  for  the  within-OR  fusion  case,  that  a  point  on  the  fused  ROC  curve  is  given 
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by 

(r,  fc(r))  =  (p  +  q  -  pq,  fA(p)  +  fv  {q)  -  h(p)fa  (<?)), 

assuming  that  the  classifiers  are  independent  in  their  measurements  [5].  Clutz  states  this  is  a  weak 
upper  bound  for  the  fused  ROC  curve. 

It  can  be  shown  that  /c  is  a  Lipschitz  continuous  mapping.  The  strategy  will  be  to  demonstrate 
that  for  some  L  >  0, 

dPl,i(fc,  fa)  <  L[dPlti(fA,  /a0  +  dPl^(fK,  /»')]• 

Consider  the  following, 


dPl,i{fc,  fa)  <  dPl,i(fA,  /a')  +  dPlii(fn,  /b') 

with  L  =  1,  and  as  desired  the  transformation  is  Lipschitz  continuous.  Therefore,  by  Theorem  3.8, 
if  converges  in  probability  to  /a  and  f^  converges  in  probability  to  /b,  then  /c”’™' converges 
in  probability  to  fa. 
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This  example  demonstrates  how  the  convergence  framework  can  be  applied  to  a  ROC  fusion 
mapping  to  show  it  is  Lipschitz  continuous  and  therefore  convergent.  This  framework  can  be  applied 
to  experimental  data  as  well.  The  conjecture  that  within-OR  and  across-OR  fusion  are  Lipschitz 
continuous  will  be  examined  experimentally  in  the  next  chapter. 
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IV.  Analysis  and  Findings 


4-1  Overview 

This  chapter  provides  an  experimental  analysis  of  the  convergence  of  a  sequence  of  empirical 
ROC  curves.  As  the  test  sample  size  n  is  increased,  the  convergence  is  measured  using  the  numerical 
approximation  of  the  average  metric  distance.  The  average  metric  distances  for  the  single-classifier 
ROC  curves  as  well  as  the  fused-classiher  ROC  curves  are  calculated.  The  relationship  between  the 
two  will  be  examined. 

In  addition  to  varying  sample  size,  the  number  of  interpolation  points,  m,  used  to  approximate 
the  ROC  curve  is  also  varied.  In  application,  interpolation  is  used  to  construct  a  continuous  ROC 
curve  from  discrete  sample  data.  As  more  interpolation  points  are  used,  the  better  this  curve 
approximates  the  actual  empirical  ROC  curve.  So  just  as  increasing  n  should  make  / a  better 
estimate  of  /,  increasing  m  should  make  the  interpolated  curve  a  better  approximate  of  the  actual 
curve.  This  approximation  is  sometimes  overlooked  when  doing  ROC  analysis. 

The  empirical  ROC  curves  in  this  experiment  are  constructed  both  parametrically  and  non- 
parametrically.  In  the  non-parametric  case,  the  frequency  of  false  positives  and  true  positives  is 
measured  at  discrete  threshold  values,  and  the  resultant  ordered  pairs  are  used  to  construct  the 
ROC  curve.  Put  more  simply,  the  number  of  false  positives  and  true  positives  will  be  counted  at 
each  n  for  each  threshold.  In  the  parametric  case,  the  underlying  statistical  distribution  of  the 
data  is  assumed,  or  in  the  case  of  this  experiment,  known.  The  target  and  non-target  sample  mean 
and  standard  deviation  will  be  calculated  for  each  n,  and  the  ROC  curve  will  be  constructed  by 
evaluating  the  cumulative  distribution  function  at  each  threshold.  This  process  is  repeated  for  five 
trials. 
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f.2  Experimental  Design 


4-2.1  Classifiers. 

This  experiment  examines  the  OR  rule  applied  to  two  classifiers  for  both  the  within  and  across 
fusion  cases.  Each  classifier  is  a  simple  two-label  classifier,  declaring  either  target  or  non-target. 
Classifier  Ag  is  defined  on  the  feature  set  X  =  R.  for  each  9  €  O  =  R.  as: 


.  ,  .  f  tar 
Ag(x )  =  < 
v  [  non 


if  x  <9 
if  x  >  9 


Classifier  B ^  is  defined  on  the  feature  set  Y  =  R  for  each  </>  €  $  =  (0,  oo)  as  follows: 


B<t,{x)  = 


tar  if  —  1  —  (j)  <  x  <  —  1  +  (f> 
non  if  otherwise 


4-2.2  Feature  Data. 

Normally  distributed  feature  data  for  both  the  target  and  non-target  classes  is  generated  using 
the  MATLAB  normrnd  function.  One  thousand  samples  are  generated  for  each  class.  The  first 
samples  from  each  class  are  selected  to  constitute  the  sample  sets  and  xioh,  with  J(")  =  x[^l 
U  Xr'oh .  This  is  done  for  n  =  10, 100,200,400,  and  1000.  By  choosing  the  sets  in  this  manner,  it 
ensures  they  are  nested,  that  is, 

Jf(!0)  (-  J5f(100)  ^  j^-(200)  JJ^(400)  J^(1000). 


Although  this  nesting  is  not  strictly  necessary  in  a  statistical  sense,  this  is  consistent  with  Alsing’s 
work  [1]. 


4-2.3  ROC  Estimates. 

Classifiers  Ag  and  B $  are  applied  to  each  X ^  to  generate  Ppp  and  PjfJ  at  each  threshold,  9i 
and  ,  respectively.  For  the  non-parametric  case,  the  number  of  true  positives  and  false  positives 
is  counted  at  each  threshold  and  divided  by  n.  For  classifier  Ag  for  each  9  £  0, 


P^{9) 


and  Pj£(0) 


#FP 

n 
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More  formally  this  can  be  expressed  as 


/>”(«)  =  cardix<  <  <=  for  i  =  1, 

fW(l)  =  £  *<"S,}  for  f  =  !  . 

where  card  is  the  cardinality  of  the  set.  For  classifier  B ^  for  each  </>€$, 

PtpW  =  and  Ppp((j>)  =  for  given  (ft. 

More  formally  this  can  be  expressed  as 

pf)W  =  card{-l  -  0  <  ari  <-!+#,.£  .V&i }  for  .  =  ^  „ 

pgw  =  for  f = !_ 

For  the  parametric  case,  the  target  sample  mean  and  standard  deviation  are  calculated 
for  each  X ^  as  well  as  the  non-target  sample  mean  ft4"jrl  and  standard  deviation  a^n.  These  are 
used  in  the  evaluation  of  the  MATLAB  normcdf  function  at  each  threshold.  For  classifier  Ag  for 
each  9, 


Ptp(O)  =  normcdf  (6», 

Pfp{9)  =  normcdf  (0 ,  '[P^on  >  &iu>n )  • 


For  classifier  for  each  (ft, 


PtpW  =  normcdf (—1  +  c ft^lMal )  -  normcdf(-l  -  <£,  Mt"r>  ^tar). 


P{fX<I>)  = 


cdf(— 1  +  (ft^non^non)  ~  normcdf (-1  -  (ft^nou^nou)- 


Finally,  the  true  probabilities  Pfp{9)  and  Ptp{9 )  are  calculated  for  each  9  and  Ppp((ft )  and 
PTP^)  are  calculated  for  each  (ft.  Similar  to  the  parametric  case,  these  are  calculated  with  the 
normcdf  function,  this  time  using  the  true  target  and  non-target  means  and  standard  deviations. 
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4-2-4  Empirical  ROC  Curve  Construction. 

Using  the  estimates  from  Section  4.2.2,  ROC  curves  for  each  classifier  for  each  n  can  be  con¬ 
structed.  Since  there  are  a  finite  number  of  threshold  values  and  sample  feature  data,  it  is  possible 
that  the  ROC  trajectory  for  multiple  threshold  values  could  project  to  the  same  Ppp  value.  In  this 
case,  the  ROC  curve  is  constructed  using  the  highest  Ppp  value  associated  with  that  Ppp  value. 
Recall  this  conforms  to  Oxley  and  Bauer’s  definition  of  a  ROC  curve  (See  equations  (2.6)  and  (2.7).). 
The  empirical  ROC  curve  for  classifier  family  A  then  is  defined  to  be 

fkl\v)  =  ma x{P^(Ae)  :  6  €  0  and  P$(Ae)  =  p}. 

Similarly,  the  empirical  ROC  curve  for  classifier  family  B  is  defined  to  be 

fin\q)  =  ma x{P^p(R0)  :  <f>  £  $  and  =  q}. 

Finally,  the  empirical  fused-OR  ROC  curve  for  classifier  family  C  is  defined  to  be 

fc\r)  =  ma x{P^  (C^)  :  (0,0)  €  0  x  $  and  Ppp{Ce,P)  =  r}, 

and  is  calculated  using  Oxley  and  Bauer’s  ROC  fusion  formulas  (See  equations  (2.10)  and  (2.11).). 

The  (Pfp,  Ptp)  values  are  interpolated  with  the  MATLAB  interpl  function.  The  MATLAB 
inter  pi  function  is  a  linear  interpolation.  To  measure  the  effect  of  the  number  of  interpolation 
points  on  the  average  metric  distance,  m  =  11, 31, 101, 501  interpolates  are  used. 

4-2.5  Test  Metric. 

Recall  from  Definition  3.3  that  average  metric  distance  can  be  defined  over  p  versus  0.  The 
numerical  approximation  for  average  metric  distance  then  is 

m  /  A  \ 

X>i  (f{n\p)J(p) ) 

dPlAf{n\f)  «  — - - - .  (4.1) 

For  the  purposes  of  this  experiment,  the  dPlp{f^n\  f)  metric  is  always  calculated  with  m  =  501. 
This  is  done  in  order  to  ensure  that  the  average  metric  distances  calculated  for  varying  m  can  be 
consistently  compared  to  each  other.  If  this  is  not  done,  an  / ^  approximated  with  m  =  11 
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interpolates  would  only  be  summed  over  11  points  in  the  average  metric  distance  calculation,  while 
an  approximated  with  m  =  31  interpolates  would  be  summed  over  31  points.  To  be  consistent, 

an  f (")  constructed  with  11,  31,  or  101  interpolates  will  always  be  evaluated  at  501  points  along  the 
curve  in  the  average  metric  distance  calculation.  Likewise,  the  true  ROC  curve  /  is  constructed 
with  m  =  501  interpolates.  All  estimated  ROC  curves,  regardless  of  the  to  used  to  approximate 
them,  are  compared  to  this  true  ROC  curve.  The  average  metric  distance  for  each  ROC  curve, 
dPl,1(f^n\  fB),  and  /c)  is  calculated  for  all  n  and  to. 


4-2.6  Within  Fusion. 

In  the  within  fusion  case,  classifiers  Ag  and  B $  act  on  the  same  feature  set.  Target  feature 
data  is  drawn  from  a  normal  distribution  with  n  =  —  1  and  a  =  l/\/2.  Non-target  feature  data  is 
drawn  from  a  normal  distribution  with  n  =  1  and  a  =  1.  Recall  the  definition  of  classifier  Ag: 


Ae(x) 


tar  if  x  <9 
non  if  x  >  9 


Figure  4.1  depicts  this  classifier.  The  threshold  values  for  Ag  are  9  €  {—4,  —3.9,  —3.8, ...,  4}. 


Figure  4.1.  Classifier  Ag  within  fusion  case. 
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Recall  classifier  Bj, : 


B,p{x ) 


tar  if  —  1  —  <j>  <  x  <  — 1  +  (f> 
non  if  otherwise 


Figure  4.2  depicts  this  classifier.  The  threshold  values  for  B $  are  <f>  £  {0, 0.05, 0.10, ...,  5}.  The 


labeled  target 


Figure  4.2.  Classifier  B $  within  fusion  case. 

ROC  fusion  for  Ag  V  B $  is  done  numerically  using  Oxley  and  Bauer’s  MATLAB  code  for  within- 
OR  fusion.  Both  the  parametric  and  non-parametric  experiments  use  the  same  feature  data  and 
threshold  values. 

4-2.7  Across  Fusion. 

In  across  fusion,  classifiers  Ag  and  B $  act  on  different  sets  of  feature  data.  For  this  experiment, 
classifier  Ag  will  use  the  same  target  and  non-target  feature  data  sets  from  the  within  case.  Classifier 
acts  on  new  feature  set  Y  =  M,  with  target  data  drawn  from  a  normal  distribution  with  (i  =  —  1 
and  o  =  1  and  non-target  feature  data  drawn  from  a  normal  distribution  with  n  =  1  and  a  =  2. 
Figure  4.3  depicts  this.  Threshold  sets  are  the  same  as  the  within  fusion  case,  specifically  9  £ 
{-4,  -3.9,  -3.8, ...,  4}  and  <j>  £  {0,  0.05,  0.10, ...,  5}. 
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4-3  Results 


4-3.1  Within-  OR  ( N on-parametric ) . 

Tables  4.1,  4.2,  and  4.3  provide  the  results  for  the  dPlJi(/|Il\ /a)5  dp1,i(f®'\  /b);  and 
dPi  :i  (Jr'K  fc)  metrics.  The  tables  show  the  averages  over  5  trials.  Looking  down  the  columns, 
as  expected,  the  average  metric  distance  decreases  as  n  increases,  with  the  exception  being  the 
m  =  11  column.  This  column  can  be  subject  to  additional  variability  due  to  the  potentially  poor 
approximation  with  such  a  small  number  of  interpolates.  The  trend  down  the  other  columns  is 
fairly  stable.  Looking  across  the  rows,  next,  there  is  a  general  downward  trend  in  values  from 
the  m  =  11  column  to  rn  =  31  column.  The  average  metric  distance  values  seem  to  stabilize  for 
m  =  31, 101,501  though.  This  could  be  explained  by  the  fact  classifiers  Ag  and  only  evaluate 
at  81  and  101  threshold  values,  respectively.  So  the  rule  of  thumb  from  this  example  would  appear 
to  be  to  use  a  number  of  interpolates  similar  to  the  number  of  threshold  values.  Anything  less  may 
add  variability  to  the  results;  anything  more  may  yield  diminishing  returns.  Finally,  it  is  interesting 
to  note  that  the  average  metric  distance  for  fc  is  on  the  order  of  that  for  /a  and  /B.  In  fact,  it  is 
even  slightly  lower. 

Figure  4.4  displays  the  results  graphically.  The  plots  displayed  are  for  the  m  =  31  case.  The 
true  ROC  curve  and  empirical  ROC  curve  are  plotted  for  A,  B,  and  C.  While  there  is  certainly 
some  difference  visually  at  n  =  10;  as  quickly  as  n  =  100,  the  empirical  ROC  curves  fall  right  on 
top  of  their  respective  true  ROC  curves.  For  n  =  200, 400, 1000,  they  are  barely  distinguishable. 

To  put  these  results  into  perspective  and  gain  an  intuitive  feel  for  these  values,  consider  first 
that  the  numerical  approximation  for  the  average  metric  distance  (See  Equation  (4.1).)  is  very  close 
to  a  Midpoint  quadrature  rule  for  numerical  integration.  So  the  average  metric  distance  can  also 
be  thought  of  as  the  area  of  the  difference  between  f ^  and  /.  Now  consider  that  an  average  metric 
distance  of  1  is  the  maximum  possible  difference  between  two  ROC  curves  and  a  value  of  0  means 
the  two  ROC  curves  are  equivalent.  Furthermore,  the  average  metric  distance  between  the  chance 
line  and  the  ideal  ROC  curve  is  0.5.  So  in  the  n  =  1000  case,  for  all  three  ROC  curves,  there  is 
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approximately  a  0.5%  difference  in  area.  Even  at  n  =  100,  the  difference  is  less  than  2%,  suggesting 


that  one  could  construct  fairly  tight  confidence  intervals  around  the  curve  for  relatively  small  sample 
sizes. 


Table  4.1.  Within-OR  (non-parametric) :  average  metric  distances  for  / 


(ra) 


n\m 

11 

31 

101 

501 

10 

0.0461 

0.0461 

0.0461 

0.0461 

100 

0.0201 

0.0179 

0.0180 

0.0180 

200 

0.0219 

0.0122 

0.0120 

0.0120 

400 

0.0231 

0.0095 

0.0080 

0.0082 

1000 

0.0200 

0.0057 

0.0042 

0.0044 

Table  4.2.  Within-OR  (non-parametric):  average  metric  distances  for 


;(n) 

J  B 


n\m 

11 

31 

101 

501 

10 

0.0604 

0.0604 

0.0604 

0.0604 

100 

0.0147 

0.0173 

0.0180 

0.0180 

200 

0.0144 

0.0169 

0.0171 

0.0170 

400 

0.0153 

0.0121 

0.0124 

0.0126 

1000 

0.0111 

0.0064 

0.0066 

0.0067 

Table  4.3.  Within-OR  (non-parametric):  average  metric  distances  for 


f(n) 

J  c  • 


n\m 

11 

31 

101 

501 

10 

0.0353 

0.0353 

0.0353 

0.0353 

100 

0.0158 

0.0156 

0.0165 

0.0165 

200 

0.0182 

0.0107 

0.0115 

0.0115 

400 

0.0229 

0.0086 

0.0070 

0.0071 

1000 

0.0197 

0.0046 

0.0035 

0.0036 

4-3.2  Across-OR  (Non-parametric). 

Tables  4.4,  4.5,  and  4.6  provide  the  results  for  the  dPl}i(f^l\  /a),  dPlp(f^\  /B),  and 
dPl,i{fc  \  fc)  metric.  The  across-OR  experiment  yields  generally  similar  results  to  the  within-OR 
case.  The  average  metric  distance  values  are  on  par  with  the  within  case,  and  the  trends  are  similar 
with  respect  to  increasing  n  and  to.  Notice  one  interesting  difference  in  Table  4.5  for  the  to  =  11 
column  though.  The  approximation  at  to  =  11  is  better  than  the  approximation  at  to  =  31, 101,  501. 
So  the  rule  discussed  previously  does  not  appear  to  be  hard  and  fast,  although  it  still  probably  is  a 
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best  practice  to  choose  an  optimal  m  based  on  the  number  of  0’s.  Another  difference  is  that  the 


average  metric  distance  for  /c  is  not  smaller  than  that  for  /a  and  /b,  however,  it  is  still  on  the  same 
order.  Figure  4.5  displays  this  graphically.  Once  again  the  convergence  visually  appears  to  be  very 
fast. 


Table  4.4.  Across-OR  (non-parametric):  average  metric  distances  for  /_ 


n\m 

11 

31 

101 

501 

10 

0.0461 

0.0461 

0.0461 

0.0461 

100 

0.0201 

0.0179 

0.0180 

0.0180 

200 

0.0219 

0.0122 

0.0120 

0.0120 

400 

0.0231 

0.0095 

0.0080 

0.0082 

1000 

0.0200 

0.0057 

0.0042 

0.0044 

Table  4.5.  Across-OR  (non-parametric):  average  metric  distances  for 


t(n) 

J  B 


n\m 

11 

31 

101 

501 

10 

0.1226 

0.1226 

0.1226 

0.1226 

100 

0.0220 

0.0272 

0.0282 

0.0282 

200 

0.0162 

0.0167 

0.0174 

0.0175 

400 

0.0174 

0.0165 

0.0164 

0.0166 

1000 

0.0087 

0.0093 

0.0092 

0.0093 

Table  4.6.  Across-OR  (non-parametric):  average  metric  distances  for  /,(. 


(n) 


n\m 

11 

31 

101 

501 

10 

0.0660 

0.0665 

0.0669 

0.0671 

100 

0.0168 

0.0194 

0.0207 

0.0208 

200 

0.0193 

0.0140 

0.0144 

0.0146 

400 

0.0210 

0.0120 

0.0110 

0.0113 

1000 

0.0162 

0.0062 

0.0054 

0.0055 

4-3.3  Within- OR  (Parametric) . 

Tables  4.7,  4.8,  and  4.9  provide  comparisons  of  the  dPl^(f<^l\  /a),  dPlti(f^\  /b),  and 
dPl,i{fc  \  fc)  metrics  for  the  parametric  and  non-parametric  cases.  The  tables  show  the  averages 
over  5  trials.  For  this  case  only  m  =  31  interpolates  was  calculated.  The  convergence  results  for 
the  parametric  case  are  comparable  to  the  results  for  the  non-parametric  case.  Figure  4.6  displays 
this  graphically.  Aesthetically  the  parametric  ROC  curves  look  much  better,  but  from  this  data  one 
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cannot  conclude  there  is  a  performance  advantage  over  the  non-parametric  case.  The  parametric 


method  is  certainly  easier  to  compute,  so  that  may  be  a  consideration  for  an  application  where 
computing  power  is  at  a  premium. 


Table  4.7.  Parametric  vs.  non-parametric:  average  metric  distances  for  / 


(n) 


parametric 

non-parametric 

n\m 

31 

31 

10 

0.0483 

0.0461 

100 

0.0129 

0.0179 

200 

0.0099 

0.0122 

400 

0.0080 

0.0095 

1000 

0.0061 

0.0057 

Table  4.8.  Parametric  vs.  non-parametric:  average  metric  distances  for 


i(n) 

J  B 


parametric 

non-parametric 

n\m 

31 

31 

10 

0.0483 

0.0604 

100 

0.0113 

0.0173 

200 

0.0098 

0.0169 

400 

0.0070 

0.0121 

1000 

0.0046 

0.0064 

Table  4.9.  Parametric  vs.  non-parametric:  average  metric  distances  for 


(n) 


parametric 

non-parametric 

n\m 

31 

31 

10 

0.0426 

0.0353 

100 

0.0114 

0.0156 

200 

0.0095 

0.0107 

400 

0.0072 

0.0086 

1000 

0.0060 

0.0046 

4-3-4  Convergence  as  a  Function  of  Sample  Size. 

Figures  4.7,  4.8,  and  4.9  display  the  convergence  as  a  function  of  sample  size  for  the  within- 
OR  fusion  (non-parametric  and  parametric)  and  the  across-OR  fusion  cases.  The  convergence  rate 
appears  to  be  of  the  order  -^=.  This  is  consistent  with  Tchebysheff’s  Theorem  [23].  Applying 
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Tchebysheff’s  Theorem  to  this  analysis,  for  positive  constant  k  at  given  p.  that  is 

Pr(|/(n)(p)-/(p)|>*^)<p 

where  a  is  the  standard  deviation  of  the  random  variable  f^n\p).  So  Tchebysheff’s  Theorem  says 
at  a  given  confidence  level,  f^n\p)  should  become  a  better  estimate  of  f(p)  as  a  function  of 
consistent  with  the  findings  of  the  experiment.  The  figures  also  provide  more  visual  evidence  that 
the  average  metric  distances  for  the  individual  ROC  curves  and  the  fused  ROC  curves  are  of  the 
same  order.  This  is  a  good  result,  since  this  implies  there  is  no  penalty  to  the  rate  of  convergence 
for  fused  ROC  curves.  Finally,  observe  that  this  experiment  supports  the  conjecture  that  the 
fused  empirical  ROC  curve  converges  for  within-OR  and  across-OR  fusion.  Furthermore,  for  this 
experiment,  the  observed  Lipschitz  constant  is  less  than  one,  and  in  fact  is  very  close  to  0.5. 
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Average  Metric  Distance 


Figure  4.7.  Average  metric  distance  as  a  function  of  sample  size  -  within-OR  (non-parametric  case) 


Figure  4.8.  Average  metric  distance  as  a  function  of  sample  size  -  within-OR  (parametric  case) 


Figure  4.9.  Average  metric  distance  as  a  function  of  sample  size  -  across-OR  (non-parametric  case) 


V.  Summary  and  Recommendations 


5.1  Summary  of  Contributions 

The  primary  contribution  of  this  thesis  is  a  proof  of  convergence  for  fused  empirical  ROC  curves. 
It  proves  that  if  the  individual  empirical  ROC  curves  converge  and  the  ROC  fusion  is  accomplished 
with  a  Lipschitz  continuous  transformation,  a  large  class  of  transformations,  that  the  convergence 
of  the  fused  ROC  curve  is  guaranteed.  This  is  an  important  contribution  since  it  establishes  that 
fused  empirical  ROC  curves  are  consistent  estimators  of  the  true  fused  ROC  curve  and  therefore 
comparisons  with  these  curves  are  valid.  This  thesis  also  provides  a  mathematical  framework  to 
prove  this  convergence.  This  framework  is  applied  to  a  generic  ROC  fusion  transformation,  as  well 
as  the  OR  rule  fusion  transformation. 

The  experiment  using  the  OR  rule  provides  valuable  insight  into  how  this  framework  can  be 
applied.  The  results  of  the  experiment  indicate  that  the  convergence  of  the  fused  ROC  curve  is 
on  the  same  order  as  the  individual  ROC  curves,  a  nice  result.  The  utility  of  this  framework 
also  extends  beyond  establishing  convergence  and  could  be  used  for  comparing  any  two  ROC  curves. 
Finally,  the  interpolation  used  to  construct  the  empirical  ROC  curve  should  be  a  consideration  when 
doing  ROC  analysis. 

5.2  Recommendations  for  Future  Research 

There  are  several  candidates  for  continuing  research  in  this  area.  There  are  potentially  many 
other  ROC  fusion  transformations  to  which  this  framework  could  be  applied.  In  particular,  the 
convergence  under  AND  fusion,  for  which  a  formula  is  available,  could  be  proven  with  these  results. 
This  could  also  be  extended  to  the  majority  vote,  and  potentially  other  fusers,  such  as  artificial 
neural  nets.  Although  it  was  conjectured,  and  supported  experimentally,  that  OR  fusion  is  Lip¬ 
schitz  continuous;  a  formal  convergence  theorem  could  be  developed.  Finally,  this  work  assumed 
statistical  independence  of  classifiers,  which  may  not  be  a  good  assumption  for  many  multiple  clas- 


47 


sifier  systems.  There  has  been  much  effort  in  studying  the  effects  of  correlated  classifiers  on  ROC 
analysis.  Convergence  of  fused  correlated  classifiers  would  be  a  rich  area  for  study. 
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